Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tage #443

Open
wants to merge 73 commits into
base: dev
Choose a base branch
from
Open

Tage #443

wants to merge 73 commits into from

Conversation

ABenC377
Copy link
Contributor

@ABenC377 ABenC377 commented Dec 9, 2024

Adding a TAGE branch predictor.

Performance relative to previous best (Perceptron) summarised below:

Benchmark BP_update time BP_update mispredict TAGE time TAGE mispredict Performance change (percentage) Mispredict change (raw)
CloverLeaf serial gcc8.3.0 armv8.4 11715 10.70% 10511 2.06% -10.28% ✅ -8.64% ✅
CloverLeaf serial gcc9.3.0 armv8.4 11830 10.40% 10321 1.85% -12.76% ✅ -8.55% ✅
CloverLeaf serial gcc10.3.0 armv8.4 12457 12.70% 10550 1.74% -15.31% ✅ -10.96% ✅
CloverLeaf serial armclang20 armv8.4 11041 14.50% 9069 1.85% -17.86% ✅ -12.65% ✅
CloverLeaf openmp gcc8.3.0 armv8.4 15635 9.78% 14530 2.00% -7.07% ✅ -7.78% ✅
CloverLeaf openmp gcc9.3.0 armv8.4 15559 7.82% 14680 1.64% -5.65% ✅ -6.18% ✅
CloverLeaf openmp gcc10.3.0 armv8.4 15291 9.33% 13872 1.56% -9.28% ✅ -7.77% ✅
CloverLeaf openmp armclang20 armv8.4 13939 13.30% 12273 1.74% -11.95% ✅ -11.56% ✅
miniBUDE openmp gcc8.3.0 armv8.4 21292 7.37% 20613 5.71% -3.19% ✅ -1.66% ✅
miniBUDE openmp gcc9.3.0 armv8.4 20652 7.43% 20364 5.71% -1.39% ✅ -1.72% ✅
miniBUDE openmp gcc10.3.0 armv8.4 20810 7.41% 20691 5.71% -0.57% ✅ -1.70% ✅
miniBUDE openmp armclang20 armv8.4 19365 10.30% 19808 5.05% +2.29% ❌ -5.25% ✅
STREAM serial gcc8.3.0 armv8.4 6619 0.51% 6672 0.42% +0.80% ❌ -0.09% ✅
STREAM serial gcc9.3.0 armv8.4 6612 0.50% 6778 0.42% +2.51% ❌ -0.08% ✅
STREAM serial gcc10.3.0 armv8.4 6583 0.44% 6649 0.42% +1.00% ❌ -0.03% ✅
STREAM serial armclang20 armv8.4 7018 1.01% 7071 0.75% +0.76% ❌ -0.26% ✅
STREAM openmp gcc8.3.0 armv8.4 10200 1.90% 10066 0.67% -1.31% ✅ -1.23% ✅
STREAM openmp gcc9.3.0 armv8.4 10037 2.08% 9859 0.50% -1.77% ✅ -1.58% ✅
STREAM openmp gcc10.3.0 armv8.4 9814 1.81% 9799 0.63% -0.15% ✅ -1.18% ✅
STREAM openmp armclang20 armv8.4 10292 3.04% 10459 1.21% +1.62% ❌ -1.83% ✅
TeaLeaf 2D serial gcc8.3.0 armv8.4 11874 15.20% 10871 1.04% -8.45% ✅ -14.16% ✅
TeaLeaf 2D serial gcc9.3.0 armv8.4 11846 15.20% 10847 1.05% -8.43% ✅ -14.15% ✅
TeaLeaf 2D serial gcc10.3.0 armv8.4 12020 15.20% 11078 1.05% -7.84% ✅ -14.15% ✅
TeaLeaf 2D serial armclang20 armv8.4 21353 9.16% 20597 3.27% -3.54% ✅ -5.89% ✅
TeaLeaf 2D openmp gcc8.3.0 armv8.4 17221 7.05% 16118 1.18% -6.40% ✅ -5.87% ✅
TeaLeaf 2D openmp gcc9.3.0 armv8.4 17224 7.75% 16485 1.43% -4.29% ✅ -6.32% ✅
TeaLeaf 2D openmp gcc10.3.0 armv8.4 16595 6.70% 16221 0.85% -2.25% ✅ -5.85% ✅
TeaLeaf 2D openmp armclang20 armv8.4 52356 9.29% 50683 2.38% -3.20% ✅ -6.91% ✅
TeaLeaf 3D serial gcc8.3.0 armv8.4 13645 8.73% 13203 1.15% -3.24% ✅ -7.58% ✅
TeaLeaf 3D serial gcc9.3.0 armv8.4 14157 10.70% 13463 1.57% -4.90% ✅ -9.13% ✅
TeaLeaf 3D serial gcc10.3.0 armv8.4 14331 11.00% 13167 1.50% -8.12% ✅ -9.50% ✅
TeaLeaf 3D serial armclang20 armv8.4 19199 22.40% 16675 1.69% -13.15% ✅ -20.71% ✅
TeaLeaf 3D openmp gcc8.3.0 armv8.4 22251 7.62% 21775 2.30% -2.14% ✅ -5.32% ✅
TeaLeaf 3D openmp gcc9.3.0 armv8.4 22774 9.03% 21750 1.47% -4.50% ✅ -7.56% ✅
TeaLeaf 3D openmp gcc10.3.0 armv8.4 22229 8.33% 20867 0.99% -6.13% ✅ -7.34% ✅
TeaLeaf 3D openmp armclang20 armv8.4 40910 16.90% 37148 1.30% -9.20% ✅ -15.60% ✅
CloverLeaf serial gcc8.3.0 armv8.4+sve 11137 10.90% 9820 2.31% -11.83% ✅ -8.59% ✅
CloverLeaf serial gcc9.3.0 armv8.4+sve 11051 9.98% 9909 1.86% -10.33% ✅ -8.12% ✅
CloverLeaf serial gcc10.3.0 armv8.4+sve 11140 12.80% 9462 1.80% -15.06% ✅ -11.00% ✅
CloverLeaf serial armclang20 armv8.4+sve 11051 13.30% 9280 1.50% -16.03% ✅ -11.80% ✅
CloverLeaf openmp gcc8.3.0 armv8.4+sve 14845 9.65% 13076 1.87% -11.92% ✅ -7.78% ✅
CloverLeaf openmp gcc9.3.0 armv8.4+sve 15310 7.96% 13208 1.80% -13.73% ✅ -6.16% ✅
CloverLeaf openmp gcc10.3.0 armv8.4+sve 14754 9.62% 13397 1.56% -9.20% ✅ -8.06% ✅
CloverLeaf openmp armclang20 armv8.4+sve 14309 12.20% 12493 1.29% -12.69% ✅ -10.91% ✅
miniBUDE openmp gcc8.3.0 armv8.4+sve 8707 14.20% 7917 3.61% -9.07% ✅ -10.59% ✅
miniBUDE openmp gcc9.3.0 armv8.4+sve 8440 7.43% 7656 3.54% -9.29% ✅ -3.89% ✅
miniBUDE openmp gcc10.3.0 armv8.4+sve 8458 7.41% 7695 3.41% -9.02% ✅ -4.00% ✅
miniBUDE openmp armclang20 armv8.4+sve 8655 20.30% 7961 1.21% -8.02% ✅ -19.09% ✅
STREAM serial gcc8.3.0 armv8.4+sve 3521 1.51% 3572 1.28% +1.45% ❌ -0.23% ✅
STREAM serial gcc9.3.0 armv8.4+sve 3534 1.50% 3980 1.28% +12.62% ❌ -0.22% ✅
STREAM serial gcc10.3.0 armv8.4+sve 3481 1.44% 3787 1.28% +8.79% ❌ -0.16% ✅
STREAM serial armclang20 armv8.4+sve 2207 2.01% 2268 1.46% +2.76% ❌ -0.55% ✅
STREAM openmp gcc8.3.0 armv8.4+sve 6821 3.90% 6740 1.79% -1.19% ✅ -2.11% ✅
STREAM openmp gcc9.3.0 armv8.4+sve 7198 3.08% 6686 1.01% -7.11% ✅ -2.07% ✅
STREAM openmp gcc10.3.0 armv8.4+sve 7067 2.81% 6646 1.18% -5.96% ✅ -1.63% ✅
STREAM openmp armclang20 armv8.4+sve 6428 3.04% 5867 1.44% -8.73% ✅ -1.60% ✅
TeaLeaf 2D serial gcc8.3.0 armv8.4+sve 11725 15.20% 10942 1.04% -6.68% ✅ -14.16% ✅
TeaLeaf 2D serial gcc9.3.0 armv8.4+sve 11584 15.20% 10899 1.04% -5.91% ✅ -14.16% ✅
TeaLeaf 2D serial gcc10.3.0 armv8.4+sve 11879 15.20% 11034 1.05% -7.11% ✅ -14.15% ✅
TeaLeaf 2D serial armclang20 armv8.4+sve 11137 9.16% 7370 0.87% -33.82% ✅ -8.29% ✅
TeaLeaf 2D openmp gcc8.3.0 armv8.4+sve 17041 7.05% 16349 1.18% -4.06% ✅ -5.87% ✅
TeaLeaf 2D openmp gcc9.3.0 armv8.4+sve 17451 7.75% 16532 1.43% -5.27% ✅ -6.32% ✅
TeaLeaf 2D openmp gcc10.3.0 armv8.4+sve 16463 6.70% 16187 0.85% -1.68% ✅ -5.85% ✅
TeaLeaf 2D openmp armclang20 armv8.4+sve 52701 9.29% 49203 1.63% -6.64% ✅ -7.66% ✅
TeaLeaf 3D serial gcc8.3.0 armv8.4+sve 12169 18.73% 10629 1.60% -12.66% ✅ -17.13% ✅
TeaLeaf 3D serial gcc9.3.0 armv8.4+sve 12183 18.70% 10639 1.07% -12.67% ✅ -17.63% ✅
TeaLeaf 3D serial gcc10.3.0 armv8.4+sve 12405 18.30% 10616 1.58% -14.42% ✅ -16.72% ✅
TeaLeaf 3D serial armclang20 armv8.4+sve 19363 22.40% 15654 1.42% -19.16% ✅ -20.98% ✅
TeaLeaf 3D openmp gcc8.3.0 armv8.4+sve 21676 7.62% 18948 2.22% -12.59% ✅ -5.40% ✅
TeaLeaf 3D openmp gcc9.3.0 armv8.4+sve 20728 9.03% 18989 1.51% -8.39% ✅ -7.52% ✅
TeaLeaf 3D openmp gcc10.3.0 armv8.4+sve 20438 8.33% 18652 0.85% -8.74% ✅ -7.48% ✅
TeaLeaf 3D openmp armclang20 armv8.4+sve 41040 8.49% 37791 0.88% -7.92% ✅ -7.61% ✅

@ABenC377 ABenC377 marked this pull request as ready for review December 10, 2024 13:06
Copy link
Contributor

@FinnWilkinson FinnWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good, a few minor points

docs/sphinx/user/configuring_simeng.rst Outdated Show resolved Hide resolved
src/include/simeng/branchpredictors/BranchHistory.hh Outdated Show resolved Hide resolved
src/include/simeng/branchpredictors/BranchHistory.hh Outdated Show resolved Hide resolved
src/include/simeng/branchpredictors/TagePredictor.hh Outdated Show resolved Hide resolved
@@ -149,13 +149,13 @@ The Branch-Prediction section contains those options to parameterise the branch
The current options include:

Type
The type of branch predictor that is used, the options are ``Generic``, and ``Perceptron``. Both types of predictor use a branch target buffer with each entry containing a direction prediction mechanism and a target address. The direction predictor used in ``Generic`` is a saturating counter, and in ``Perceptron`` it is a perceptron.
The type of branch predictor that is used, the options are ``Generic``, ``Perceptron``, and ``Tage``. Each of these types of predictor use prediction tables with each entry containing a direction prediction mechanism and a target address. The direction predictor used in ``Generic`` and ``TAGE`` is a saturating counter, and in ``Perceptron`` it is a perceptron. ``TAGE`` also uses a series of further, tagged prediction tables to provide predictions informed by greater branch histories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a good reason behind using Tage and TAGE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not. I've udpated to Tage throughout, as this is the capitalisation used in the config yaml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the creator uses all forms of capitalisation

@@ -29,10 +29,15 @@ Queue-Sizes:
Load: 40
Store: 24
Branch-Predictor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some TX2 diagrams note it's use of a multi-history branch predictor. I assume this is TAGE-like so maybe apply this config update to the TX2 YAML as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like it would be. I've updated the TX2 config as well.

for (uint32_t i = 0; i < numTageTables_; i++) {
std::vector<TageEntry> newTable;
for (uint32_t j = 0; j < (1ul << tageTableBits_); j++) {
TageEntry newEntry = {2, 0, 1, 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we not want to initialise the TageEntry with a SatCnt equal to the once used in the btb_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch

// global history (folded onto itself to make it of the correct size).
uint64_t h1 = (address >> 2);
uint64_t h2 = globalHistory_.getFolded(1ull << (table + 1), tageTableBits_);
// Then truncat the XOR to make it fit thed esired size of an index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*the desired

// global history (folded onto itself to make it of the correct size).
uint64_t h1 = (address >> 2);
uint64_t h2 = globalHistory_.getFolded(1ull << (table + 1), tageTableBits_);
// Then truncat the XOR to make it fit thed esired size of an index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*truncate

FinnWilkinson
FinnWilkinson previously approved these changes Dec 16, 2024
@FinnWilkinson
Copy link
Contributor

Could you also add tage to the a64fx_SME.yaml config

FinnWilkinson
FinnWilkinson previously approved these changes Dec 18, 2024
Copy link
Contributor

@dANW34V3R dANW34V3R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean PR. Nicely precisely commented and everything is very easily readable. I like the branch history class which is also well explained

if (i == 0) {
history_[i] |= ((isTaken) ? 1 : 0);
} else {
history_[i] |= (((history_[i - 1] & (1ull << 63)) > 0) ? 1 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need the conditional statement? After doing the AND you could shift right by 63 to get your 0 or 1. Would be slightly fewer cycles and more understandable/readable in my eyes (you may disagree)

Copy link
Contributor Author

@ABenC377 ABenC377 Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conditional is needed here. Whats being loaded into the uint64 depends on where it is in the vector. All but the least-significant uint64s get the MSB of the next uint64 added as the LSB. But the least-significant uint64 gets isTaken added as the LSB. However, if I'm misunderstanding your Q LMK.

* outcome, 'position' would be 0.
* */
void updateHistory(bool isTaken, uint64_t position) {
if (position < size_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert position being < size_ as above, or are there cases where this could "validly" be greater? For instance, if you are trying to update an entry that has been lost from the history because there have been too many branches in the meantime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly as you say, I don't think that this should be an assert as the core may validly try to update a history that is no longer being tracked. The reason that we should allow this is to allow the pipeline not to need to know the size of the branch history. We're already ensuring that this doesn't cause problems with our if statement on 82.

* access and manipulate large branch histories, as are needed in
* sophisticated branch predictors.
*
* The bits of the branch history are stored in a vector of uint64_t values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"vector" should be "array"

std::vector<std::pair<uint8_t, uint64_t>> btb_;

/** The bitlength of the Tagged tables' indices.
* Each tagged table with have 2^bits entries. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With -> will


uint64_t TagePredictor::getTag(uint64_t address, uint8_t table) {
// Hash function here is pretty arbitrary
uint64_t h1 = address;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to remove the 2 LSBs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Ideally the hashes for the tag and the index should never each produce the same value for two different branches. Therefore, because the index does remove the 2 LSBs, keeping them here makes the information being passed into the hashes different and so improves the accuracy of the BP (reduces the risk of this type of accidental clashing).

Only needed for a ``Tage`` predictor. The number of tagged tables used by the predictor, in addition to a default prediction table (i.e., the BTB). Therefore, a value of 3 for ``Num-Tage-Tables`` would result in four total prediction tables: one BTB and three tagged tables. If no tagged tables are desired, it is recommended to use the ``GenericPredictor`` instead.

Tage-Length
Only needed for a ``Tage`` predictor. The number of bits used to tage the entries of the tagged tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "tage" in the latter sentence meant to be that or rather "tag"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance optimisation
Projects
Status: Changes Requested
Development

Successfully merging this pull request may close these issues.

4 participants